Disaster Recovery Framework

ABSTRACT

A System and method of orchestrating failover operations of servers providing services to an internal computer network includes a DR server configured to execute a control script that performs a failover operation. Information needed to perform the failover operation is stored on the DR server thereby eliminating the need to store agents on each of the application&#39;s primary and backup servers. The DR server may provide a centralized location for the maintenance an update of the failover procedures for the internal network&#39;s redundant services. A failover operation may be initiated by an authorized user in communication with the internal computer network.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to systems and methods for management ofservices within a network of computer systems and more specifically toservices for server failover within such systems.

2. Description of the Related Art

An organization often requires that certain services that support theorganization's mission be available throughout the day. Services may beprovided through applications executing on servers connected to theorganization's internal network, such as, for example, an intranet.Interruption of the service may adversely affect the operation of theorganization. A redundant server system may be used to minimize anyservice interruption. A redundant server system includes a primaryserver and a backup server that are both configured to execute theapplication providing the service. In normal operation, the service isprovided by the application executing on the primary server. If theservice is interrupted, the backup server can provide the service bystarting the application on the backup server. The process of switchingfrom the primary server to the backup server is commonly referred to asa failover. The process of switching between the primary and backupserver may automated by installing software agents on the primary andbackup servers that execute the failover process on their respectiveservers.

A large organization may have tens or hundreds of services that musthave high availability and require backup servers and procedures toexecute the switch when required. Moreover, the organization may use avariety of servers and applications that each requires a differentshutdown or startup procedure. Therefore, there remains a need forsystems and methods that can manage failover operations across theorganization's network from anywhere on the network.

SUMMARY OF THE INVENTION

A system and method of orchestrating failover operations of serversproviding services to an internal computer network includes a DR serverconfigured to execute a control script that performs a failoveroperation. Information needed to perform the failover operation isstored on the DR server thereby eliminating the need to store agents oneach of the application's primary and backup servers. The DR server mayprovide a centralized location for the maintenance and update of thefailover procedures for the internal network's redundant services. Afailover operation may be initiated by an authorized user incommunication with the internal computer network.

One embodiment of the present invention is directed to a systemcomprising: a primary server in communication with an internal computernetwork, the primary server executing an application providing a serviceto the internal computer network; a backup server in communication withthe internal computer network, the backup server capable and configuredto execute the application; a DR server in communication with theinternal computer network; and a failover script stored on the DRserver, the failover script performing a failover operation on thebackup server when executed on the DR server.

Another embodiment of the present invention is directed to a method oforchestrating a failover operation from a DR server in communicationwith an internal computer network, a primary server and a backup server,the primary server and backup server configured to run an applicationthat provides a service to the internal computer network, the methodcomprising: receiving a command through the internal computer networkfrom a user to perform a failover operation for the application;retrieving a security ticket from the primary server based on the user;reading a configuration file stored on the DR server, the configurationfile containing information for the failover operation of theapplication; and executing a failover operation of the application basedon the information read from the configuration file.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention will be described by reference to the preferred andalternative embodiments thereof in conjunction with the drawings inwhich:

FIG. 1 is diagram illustrating a computer network used in someembodiments of the present invention;

FIG. 2 is flow diagram illustrating an embodiment of the presentinvention; and

FIG. 3 is shows a portion of a configuration file used in someembodiments of the present invention.

DETAILED DESCRIPTION

An embodiment of the present invention provides for the orchestration ofdisaster recovery activation for services provided on an organization'sinternal computer network. A log of the recovery process is generatedand stored for later review of the status of the recovery process. Therecovery operation is preferably orchestrated by a control script thatcontains subroutines that execute portions of the recovery process on atarget server. Target-specific information for the recovery process maybe stored in a configuration file. The control script may be activatedvia a command-line interface or via a web front-end that is accessedthrough the organization's internal computer network. Embodiments of thepresent invention do not require modification of the applicationsproviding the service or installation of agents on the target servers.

FIG. 1 is diagram illustrating a computer network used in someembodiments of the present invention. In a preferred embodiment, anexternal computer 115 may access an organization's computer network 120via an external communications network 110 such as, for example, theinternet. A gateway server 130 provides a bridge between the externalnetwork 110 and the organization's internal computer network 150. In apreferred embodiment, the internal computer network 150 is an intranet.The gateway server 130 also provides security to computer network 120 bypreventing unauthorized access to network 120. The structure andoperation of computer networks are known and described in numerouspublications such as, for example, Craig Zacker, Networking: TheComplete Reference. The McGraw-Hill Companies, Berkeley, Calif. (2001),incorporated herein by reference.

Users may access the resources and services of the computer network 120Through a computer 140 that is directly connected to the intranet 150 orthrough external computer 115 via the internet 110. Services areprovided by applications executing on one or more servers. In theillustrative example of FIG. 1, a service 170 is provided by primaryservers 172 and 174. Each primary server 172 and 174 may execute aportion of an application providing the service 170. The organizationmay consider service 170 sufficiently important to provide backupservers 182 and 184 that are capable of providing the service if theservice from primary servers is interrupted. in some embodiments, thelocation of backup servers 182 and 184 is in a different geographicalregion and is usually referred to as a failover site 180.

The process of switching servers providing a service is generallyreferred to as a failover process. In some embodiments of the presentinvention, the failover process may include three types of failoveroperations that each covers a possible disaster situation.

In a first situation, herein referred to as a migration, both theprimary and failover sites are available and the service is switchedfrom the primary site to the failover site. During the migrationoperation, the application providing the service at the primary site isfirst shutdown, followed by any necessary data replication before theapplication is started at the failover site.

In a second situation, herein referred to as a takeover, the primarysite is unavailable thereby preventing an orderly Shutdown of theapplication at the primary site or any necessary data replication to thefailover site. The primary site may become unavailable for a variety ofreasons such as, for example, a power loss at the primary site,interruption of a communication link between the primary site and theorganization's intranet, or physical damage to the servers or datastorage devices at the primary site. During the takeover operation, theapplication is started at the failover site.

In a third situation, herein referred to as a failback, the service isbeing provided by the failover site and the service is switched back tothe primary site. During the migration operation, the applicationproviding the service at the failover site is first shutdown, followedby any necessary data replication before the application is started atthe primary site.

Authorization to initiate a failover is tightly controlled and isusually vested in only a few authorized managers. When one of theauthorized managers determines that a failover is necessary, he or sheenters the commands necessary to accomplish the failover process. Inorder to reduce error, the series of commands are printed in a disasterrecovery manual that is accessible to the authorized manager. Thefailover operation may require that the authorized manager log ontoseveral different servers to complete the failover operation. Forexample, in addition to logging onto the primary and backup servers, theauthorized manager may also require access to a server managing thedomain name service (DNS) for the organization's computer network and toa server managing a storage area network (SAN) for the organization'scomputer network.

In a preferred embodiment, the series of commands are stored as a scripton a DR server 160. The DR server 160 preferably stores a failoverscript for each service that has a failover site. In some embodiments,DR server 160 may act as a central depository for recovery scripts for aregion thereby providing for easier maintenance and updates of therecovery scripts.

FIG. 2 is a flowchart illustrating the failover process. In a preferredembodiment, a control script manages the failover process and callsother scripts or subroutines that execute target-specific procedures onthe target server. After the control script is activated by theauthorized manager, step 210 checks to confirm that the script isrunning as a correct user on the correct target host or server. In someembodiments, only specific user/host combinations are allowed to executethe failover procedure. If the user/host combination is invalid, thescript terminates, logs the result in a log file, and displays theresult to the authorized manager. If the user/host combination is valid,the script checks the validity of any arguments or options specifiedwith the invocation of the control script. For example, an action optionmay be specified when the control script is initiated. The action optionidentifies the operation to be performed by the control script andshould therefore specify a valid operation. The control script confirmsthat the action option specifies one of the valid operations in step210. If the action option is invalid, the script terminates, records theresults in a log file and displays the result to the authorized manager.

In step 220, a configuration file for the application is read andverified. The configuration file is verified by comparing theconfiguration file to a template file that reflects the rules for validconfiguration data. If the configuration file contains invalid data, thescript terminates, records the result in a log file, and displays theresult to the authorized manager.

If the configuration file contains valid data, an authentication andauthorization procedure is performed in step 230 before control scriptcommands are executed on the target host. Authentication andauthorization may follow any of the known security procedures fornetworks. In a preferred embodiment, authentication and authorization isaccomplished using a Kerberos protocol described in RFC 1510 availablefrom http://www.freesoft.org/CIE/RFC/1510/(September 1993), hereinincorporated by reference. A Kerberos ticket for the applicationproviding the service is stored on the DR server for each authorizeduser. Each primary and backup server stores a file containing a list ofKerberos tickets that it will accept. Each Kerberos ticket allows onlythe specific user/host/application combination to establish a securechannel with the target host.

Once the secure channel is established between the target host and theDR server, the DR server transmits a script command to the target hostfor execution on the target host in step 240. The target host returns asignal to the DR server indicating a status of the executed command,i.e., whether the script command was successfully executed or failed.The DR server checks the returned signal in step 250. If the returnedsignal indicates a successful execution of the command, the DR serverdetermines if the executed command was the last command in the script instep 255. If the executed command is the last command, the DR serverrecords the result, displays the result to the user, and terminates thescript. If the executed command is not the last command, the scriptbranches back to step 240 to execute the next script command.

If the returned signal indicates an unsuccessful execution of thecommand, the DR server examines an onFail option associated with thecommand in step 260. If the onFail option is set to DIE, the DR serverprints an error message to the log file in step 280, displays the errormessage to the user, and terminates the script in step 290. If theonFail option is set to WARN, the DR server prints an error message tothe log file in step 290 and branches back to step 240 to execute thenext script command. If the onFail option is set to RETRY, the DR serverre-executes the command in step 265 before branching to step 250 todetermine if the re-executed command was successfully executed. TheRETRY flag may be followed by a repeat number and a DIE or WARN flag.For example, if onFail=RETRY, 2 DIE, the DR server will resend thecommand to the target host for re-execution twice and if the command isstill unsuccessful after the second retry, the DR server will branchaccording to the DIE flag.

FIG. 3 shows a portion of a configuration file that may be used in someembodiments of the present invention. In a preferred embodiment, theconfiguration file is a plain text file in a key-value format containinga target 312, a key 314 and a paired value 316. Each target contains anonFail key that describes an action to take if the subroutine for thetarget fails. In FIG. 3, a MIGRATE target 350 is shown with an integerkeys that correspond to a script step that is executed when a migrationoperation is selected for the failover process. In the example shown inFIG. 3, assuming that the authorized manager has selected the migrateoperation and after a secure channel is established, the DR serverexecutes the first script command in the MIGRATE target, which in thisexample is cname->delete(CNAME1). The cname module requires fiveparameters that identify the primary host, the failover host, an alias,a user name, and a password. In the example shown in FIG. 3, the steps1-4 switch the alias names between the primary server and the failoverserver by first deleting the alias from the primary and failover server(steps 1-2) and adding the new alias to the primary and failover server(steps 3-4). The user name and password specified in CNAME1 310 andCNAME2 320 allow the authorized manager to log onto, the organization'sDNS server that manages the domain names for the organization's servers.Both CNAME1 310 and CNAME2 320 specify that onFail=WARN, which is usedby the control script to determine an action if the command is notsuccessfully executed.

In step 5, the DR server sends a command to the primary server todismount the application's file system directory and in step 6, the DRserver sends a command to the primary server to deport the application'sdisk group. In step 7, the DR server executes the command,srdf->failover(SRDF) that switches the state of the primary andsecondary storage to allow the secondary storage to be mounted for thefailover host. The srdf module uses two parameters that are defined inthe SRDF module 330 that identify a gatekeeper host that manages theprimary and failover storage devices and defines the specific storagedevices that are switched. The particular commands in the srdf moduledepend on the SAN manager used to control the primary and failoverstorage devices. In step 6, the DR server sends a command to thefailover server to port application's disk group on the failover server.In step 7, the DR server sends a command to the failover server to mountthe application's file system on the failover server.

Having thus described at least illustrative embodiments of theinvention, various modifications and improvements will readily occur tothose skilled in the art and are intended to be within the scope of theinvention. Accordingly, the foregoing description is by way of exampleonly and is not intended as limiting. The invention is limited only asdefined in the following claims and the equivalents thereto.

What is claimed:
 1. A system comprising: a primary server in communication with an internal computer network, the primary server executing an application providing a service to the internal computer network; a backup server in communication with the internal computer network, the backup server configured to execute the application; a DR server in communication with the internal computer network; and a failover script stored on the DR server, the failover script performing a failover operation on the backup server when executed on the DR server.
 2. The system of claim 1 wherein the internal computer network is an intranet.
 3. The system of claim 1 wherein the failover operation comprises a migration of the service from the primary server to the backup server.
 4. The system of claim 1 wherein the failover operation comprises a takeover of the service by the backup server.
 5. The system of claim 1 wherein the failover operation comprises a failback of the service from the backup server to the primary server.
 6. The system of claim 1 wherein the failover script is initiated to begin execution on the DR server from a computer in communication with the internal computer network.
 7. A method of orchestrating a failover operation from a DR server in communication with an internal computer network, a primary server and a backup server, the primary server and backup server configured to execute an application that provides a service to the internal computer network, the method comprising: receiving a command through the internal computer network from a user to perform a failover operation for the application; retrieving a security ticket from the primary server based on the user; reading a configuration file stored on the DR server, the configuration file containing information for the failover operation for the application; and executing the failover operation for the application based on the information read from the configuration file.
 8. The method of claim 7 wherein the step of executing further comprises: logging on to a DNS server providing a domain name service to the internal computer network; and switching a DNS alias of the primary server with a DNS alias of the backup server.
 9. The method of claim 7 wherein the step of executing further comprises: logging on to a gatekeeper host providing storage area network management service to the internal computer network; and switching a state of a primary storage with a state of a secondary storage thereby enabling mounting of the secondary storage for the backup server.
 10. The method of claim 7 wherein the step of executing further comprises executing a script command from the configuration file.
 11. The method of claim 10 wherein the step of executing further comprises: receiving a return signal indicating a status of the executed script command; recording the status in a log file; and displaying the status to the user.
 12. The method of claim 11 wherein the script command is re-executed if the status of the executed script command indicates failure.
 13. The method of claim 11 wherein if the status of the executed script command indicates failure, a next script command from the configuration file is executed. 